subset(surveys,
subset =
species_id == "DS")STAT 331
https://docs.google.com/presentation/d/19u5djgMsPLtxoM-rfAQuLyP89B4nYdW8uBhNAEJQoS8/edit?usp=sharing
subset()
Return subsets of vectors, matrices or data frames which meet conditions.
We want functions that accomplish one task!
We want functions with intuitive names!
filter()
select()
mutate()
summarize()
arrange()
group_by()
Brainstorm definitions for each verb
filter()
select()
mutate()
group_by()
summarize()
arrange()
The Pipe |>
Suppose we would like to study how the ratio of penguin body mass to flipper size differs across the species. Arrange the following steps into an order that accomplishes this goal (assuming the steps are connected with a |>).
arrange(med_mass_flipper_ratio)
group_by(species)
penguins
summarize(med_mass_flipper_ratio = median(mass_flipper_ratio))
mutate(mass_flipper_ratio = body_mass_g / flipper_length_mm))
You have data on each Cal Poly student for the 2020-2021 academic year. You are tasked with reporting how the number of CR/NC courses students take differs based on department.
| name | department | CRNC_f20 | CRNC_w21 | CRNC_s21 |
|---|---|---|---|---|
| Gonzales, Yasmin | Business | 1 | 3 | 0 |
| al-Hossain, Misbaah | Biology | 2 | 2 | 1 |
| Hyland, Cassidy | Liberal Studies | 0 | 0 | 1 |
| Landry, Conner | Political Science | 2 | 0 | 0 |
| Lai Zhou, Meghan | Business | 0 | 0 | 2 |
| Navarrete, Guadalupe | Business | 0 | 1 | 2 |
| Yahashi, Hannah | Liberal Studies | 0 | 1 | 0 |
| Mcbroom, Gabrielle | Biology | 1 | 1 | 1 |
| Hepp, Kayla | Political Science | 0 | 2 | 4 |
| Yost, Aubrey | Chemistry | 1 | 0 | 2 |
What data wrangling operations would you use?
What order would you use to accomplish this goal?
Problem Statement:
Department totals for number of CR / NC courses]
Step 1: Get totals for each student
Step 2: Get department totals
Step 3: Arrange the totals
Often you are interested in one specific summary statistic!
# A tibble: 5 × 2
# Groups: department [5]
department n
<chr> <int>
1 Business 3
2 Biology 2
3 Liberal Studies 2
4 Political Science 2
5 Chemistry 1
# A tibble: 1 × 2
# Groups: department [1]
department n
<chr> <int>
1 Political Science 2
pull()
[1] 2